Data Preparation for sentiment analysis model
¶


by Green AI Innovations

Introduction:

This document presents a Exploratory Data Analysis performed on collected data for one of the models in the proposed pipeline - Sentiment Analysis Model.


Emotion detection from text is one of the challenging problems in Natural Language Processing.
The reason is the unavailability of labeled dataset and the multi-class nature of the problem.
Humans have a variety of emotions and it is difficult to collect enough records for each emotion and hence the problem of class imbalance arises.

Data preparation is a crucial step in natural language processing (NLP) as it can significantly impact the performance of NLP models.
Here are some reasons why data preparation is important for NLP models:
  • Quality of the data: NLP models require high-quality data to perform accurately. Data preparation involves cleaning, preprocessing, and transforming raw data into a format suitable for NLP models.

This process ensures that the data used to train the model is of high quality and free from errors or inconsistencies.

  • Size of the data: NLP models require a large amount of data to achieve good performance. Data preparation involves collecting, processing, and organizing large volumes of text data to create a large corpus.

The larger the corpus, the better the performance of the NLP model.

  • Feature engineering: NLP models require carefully engineered features that capture the nuances of natural language.

Data preparation involves selecting and engineering the appropriate features, such as n-grams, word embeddings, and syntactic features, to improve the performance of the NLP model.

  • Model training: NLP models require extensive training to learn the patterns and relationships in the data.

Data preparation involves splitting the data into training, validation, and testing sets to ensure that the model is trained on a representative sample of the data.

In summary, data preparation is a critical step in NLP as it ensures that the model is trained on high-quality, representative data and that the features are carefully engineered to capture the nuances of natural language.


Dataset:

The datasets collected during data sourcing has already been initially prepared during the EDA which is described in a previous file.
In this notebook we are loading a pre prepared data and continue to pre process it for modelling in later phase.

Planning:

  1. Loading libraries & Data

  2. Encoding classes

  3. Plotting class distribution

  4. Balancing classes

  5. Categorizing classes

  6. Pre-Processing text features

  7. Saving data file

  8. Conclusion




1. Loading libraries & Data:


In [ ]:
# NumPy is a library for numerical computing in Python, providing a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays.
import numpy as np

# Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data manipulation library for Python. It provides data structures for efficiently storing and manipulating large and complex data sets.
import pandas as pd

# Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
import matplotlib.pyplot as plt
from matplotlib import rc

# Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms.
from sklearn.model_selection import train_test_split

# Plotly Express is a high-level data visualization library for Python.
import plotly.express as px

# Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. 
# It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, 
# along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, semantic reasoning, and wrappers for industrial-strength NLP libraries.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
from bs4 import BeautifulSoup

# String is a Python library that contains a set of string constants, including the punctuation characters.
import string

# Make plot interactive in HTML format
import plotly.io as pio
pio.renderers.default='notebook'
In [ ]:
# Importing data from EDA output (initially prepared for EDA)
data = pd.read_csv('SA_model_data_after_EDA')

# Dropping dummy column
data = data.drop('Unnamed: 0', axis=1)

# Printing loaded df
data
Out[ ]:
emotion text needed_model num_words text_length
0 anger point today someone says something remotely ki... yes 10 69
1 anger game day minus 14 30 relentless yes 6 31
2 anger game pissed game year blood boiling time turn ... yes 10 58
3 anger found candice candace pout likes yes 5 32
4 anger cannot come muma 60th 25k tweets soreloser yes 7 42
... ... ... ... ... ...
66009 happiness succesfully following tayla yes 3 27
66010 love happy mothers day love yes 4 22
66011 love happy mothers day mommies woman man long momma... yes 10 58
66012 happiness wassup beautiful follow me peep new hit single... yes 13 73
66013 love bullet train tokyo gf visiting japan since thu... yes 12 88

66014 rows × 5 columns

2. Encoding classes:


As not all emotion classes required for the pipeline are available in the gathered data. We rename some additional ones to fit our use case.
This has a low impact on the final solution.

In [ ]:
# Renaming neutral class to rq - rectorial question
data.loc[data["emotion"] == "neutral", "emotion"] = "rq"

# Changing classifier for if "needed_model"
data.loc[data["emotion"] == "rq", "needed_model"] = "yes"

# Renaming hate class to anger 
data.loc[data["emotion"] == "hate", "emotion"] = "anger"

# Changing classifier for if "needed_model"
data.loc[data["emotion"] == "anger", "needed_model"] = "yes"

3. Plotting class distribution:


Now, we take a look at the distribution of classes and their number of observations.
In order to output with an unbiased model, training data classes needs to be similar.

In [ ]:
# Creating a subset of dataframe for plotting
df = data.groupby(['emotion', 'needed_model'])['emotion'].count().reset_index(name="count").sort_values('count', ascending=False)

# Calculating mean ammount of classes 
avg = data["emotion"].value_counts().values.mean()

# Initiating the bar chart
fig = px.bar(df,    # Dataframe
            y='count',  # y value
            x='emotion',    # x value
            text_auto='.2s',    # text for labels
            color= 'needed_model', # plot color palette
            title="Class distribution in the dataframe") # title for the chart

# Adding a horizontal "target" line
fig.add_shape(type="line",  # selecting type
              line_color="salmon",  # selecting color
              line_width=3, # selecting width
              opacity=1, # selecting opacity
              line_dash="dot", # dash of line
              x0=0, # selecting position, start and end point, values at start and end
              x1=1, 
              xref="paper", 
              y0=avg, 
              y1=avg, 
              yref="y", 
)

# Adding plot legend and renaming x,y labels
fig.update_layout(legend_title_text='Is the class needed for model?', # legend label
                  title="Class distribution in the dataframe", # chart title 
                  xaxis_title="Class label", # x label
                  yaxis_title="Class count", # ylabel
                  font=dict(family="Courier New, monospace", # font settings
                            size=18,
                            color="black")
)

# Adding annotation to the mean line 
fig.add_annotation(xref="paper", # selecting style
                   x=0.98, # selecting x and y positions
                   y=5500,
                   text="Average number of classes: "+str(round(avg,1)), # pasting text with avg parameter
                   showarrow=False # disabeling arrow
)

# Updating layout style
fig.update_layout(barmode="group", 
                  clickmode="event+select", 
                  xaxis_tickangle=-45)

# Selecting settings for legend
fig.update_layout(height=1000, # changing height of the whole vis
                  legend=dict(x=0.85, # selecting legend position
                              y=1.3,
                              traceorder="reversed",
                              title_font_family="Courier New, monospace", # font settings
                              font=dict(family="Courier",
                                        size=12, 
                                        color="black"),
                              bgcolor="white", # color settings
                              bordercolor="navy",
                              borderwidth=4)
)

# Finally, showing the figure
fig.show()

4. Balancing classes:


As all needed for modelling classes are not balanced we need to do it manually.
First, taking a more detailed look at the distribution.
Finishing with balancing to the lowest class.

In [ ]:
# Filter the data to only include the needed model columns for the emotion and text, and the rows where the 'needed_model' value is 'yes'.
data = data[data['needed_model']=='yes'][['emotion', 'text']]

# Find the unique classes (emotions) in the dataset and count the number of classes.
classes=sorted(list(data['emotion'].unique()))
class_count = len(classes)

# Print the number of classes in the dataset.
print('The number of classes in the dataset is: ', class_count)
print('')

# Group the data by emotion to get the count of each class.
groups=data.groupby('emotion')
print('{0:^30s} {1:^13s}'.format('CLASS', 'VALUE COUNT'))
countlist=[]
classlist=[]
for label in sorted(list(data['emotion'].unique())):
    group=groups.get_group(label)
    countlist.append(len(group))
    classlist.append(label)
    print('{0:^30s} {1:^13s}'.format(label, str(len(group))))

# Get the classes with the minimum and maximum number of train images.
max_value=np.max(countlist)
max_index=countlist.index(max_value)
max_class=classlist[max_index]
min_value=np.min(countlist)
min_index=countlist.index(min_value)
min_class=classlist[min_index]

# Print results
print('')
print(max_class, ' has the most examples= ', max_value, ' ', min_class, ' has the least examples= ', min_value)
The number of classes in the dataset is:  7

            CLASS               VALUE COUNT 
            anger                  5792     
             fear                  4600     
          happiness                13400    
             love                  5314     
              rq                   8271     
           sadness                 12369    
            worry                  8347     

happiness  has the most examples=  13400   fear  has the least examples=  4600
In [ ]:
# This function takes a Pandas DataFrame (df), a maximum number of samples to keep per class (max_samples), a minimum number of samples to keep per class (min_samples), and the column in the DataFrame that contains the class labels (column).
def trim(df, max_samples, min_samples, column):
    # A copy of the input DataFrame is made.
    df=df.copy()

    # The DataFrame is grouped by the column containing the class labels.
    groups=df.groupby(column)    

    # A new empty DataFrame is created to store the trimmed data.
    trimmed_df = pd.DataFrame(columns = df.columns)

    # The data is grouped by class, and for each class:
    for label in df[column].unique(): 
        
        # Get the group for the current label.
        group=groups.get_group(label)
        
        # Get the number of samples in the current group.
        count=len(group)    
        
        # If the number of samples in the current group is greater than the maximum number of samples allowed, 
        # randomly sample the group to the maximum number of samples, and add the result to the trimmed DataFrame.
        if count > max_samples:
            sampled_group=group.sample(n=max_samples, random_state=123,axis=0)
            trimmed_df=pd.concat([trimmed_df, sampled_group], axis=0)
        
        # If the number of samples in the current group is between the minimum and maximum,
        # add the entire group to the trimmed DataFrame.
        else:
            if count>=min_samples:
                sampled_group=group        
                trimmed_df=pd.concat([trimmed_df, sampled_group], axis=0)

    # Print a message to indicate the maximum and minimum number of samples after trimming.
    print('After trimming, the maximum samples in any class is now ',max_samples, ' and the minimum samples in any class is ', min_samples)

    # Return the trimmed DataFrame.
    return trimmed_df
In [ ]:
# Parameter for max samples
max_samples=4000 

# Parameter for min samples
min_samples=4000

# Parameter for column to 'resize'
column='emotion'

# Calling function on dataset
data = trim(data, max_samples, min_samples, column)

# Find the unique classes (emotions) in the dataset and count the number of classes.
classes=sorted(list(data['emotion'].unique()))
class_count = len(classes)

# Print the number of classes in the dataset.
print('The number of classes in the dataset is: ', class_count)
print('')

# Group the data by emotion to get the count of each class.
groups=data.groupby('emotion')
print('{0:^30s} {1:^13s}'.format('CLASS', 'VALUE COUNT'))
countlist=[]
classlist=[]
for label in sorted(list(data['emotion'].unique())):
    group=groups.get_group(label)
    countlist.append(len(group))
    classlist.append(label)
    print('{0:^30s} {1:^13s}'.format(label, str(len(group))))
After trimming, the maximum samples in any class is now  4000  and the minimum samples in any class is  4000
The number of classes in the dataset is:  7

            CLASS               VALUE COUNT 
            anger                  4000     
             fear                  4000     
          happiness                4000     
             love                  4000     
              rq                   4000     
           sadness                 4000     
            worry                  4000     

5. Categorizing classes:


NLP models require to input labels encoded as numbers and not strings.
In this step, we create a simple encoder/decoder to categorize text class labels to numbers (integers).

In [ ]:
data['label'] = pd.Categorical(data['emotion']).codes

test_keys = pd.Categorical(data['emotion']).categories
test_values = pd.Categorical(data['label']).categories.values.tolist()

decoder = {}
for key in test_keys:
    for value in test_values:
        decoder[key] = value
        test_values.remove(value)
        break

data = data[['label', 'text']]

decoder
Out[ ]:
{'anger': 0,
 'fear': 1,
 'happiness': 2,
 'love': 3,
 'rq': 4,
 'sadness': 5,
 'worry': 6}

6. Pre-Processing text features:


This is the last step in data preparation, we are preparing input text for model training in following manners:
  • Tokenization is the process of breaking down a text into smaller units called tokens. In natural language processing (NLP),

tokenization is an essential step before any further processing of the text. One way to tokenize text is by using the RegexpTokenizer class from the nltk.tokenize module in Python.

  • Lemmatization is the process of converting a word into its base or dictionary form, known as a lemma. In natural language processing (NLP),

lemmatization is often used to reduce the inflectional forms of words to a common base form for analysis and comparison.
The WordNetLemmatizer class from the nltk.stem module in Python is a popular tool for lemmatization.

  • Stemming is the process of reducing words to their root form or base form, which is known as a stem.

This process involves removing suffixes and prefixes from the words to get their base form. Stemming helps to reduce the number of words that need to be processed and analyzed,
which can be useful in various applications such as information retrieval and text classification.
PorterStemmer is a stemming algorithm that is widely used in NLP. It is an iterative algorithm that applies a set of rules to a word to obtain its stem.
The algorithm works by identifying suffixes and removing them until the stem is obtained. PorterStemmer is a rule-based algorithm and applies a set of rules based on the length and structure of the word being stemmed.



In [ ]:
# Parameter for tokenizer (might be adjusted later based on pre-trained model requirement)
tokenizer = RegexpTokenizer(r'\w+')

# Parameter for lemmanizer (might be adjusted later based on pre-trained model requirement)
lemmatizer = WordNetLemmatizer()

# Parameter for stemmer (might be adjusted later based on pre-trained model requirement)
stemmer = PorterStemmer()

# This function uses BeautifulSoup to remove any HTML tags from the input text and returns the cleaned text.
def remove_html(text):
    soup = BeautifulSoup(text, 'lxml')
    html_free = soup.get_text()
    return html_free

# This function takes in a list of words as input and removes any stopwords (common words that do not add much meaning to the text) from the list. It returns the list of non-stopwords.
def remove_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return words

# This function takes in a list of words and lemmatizes them (i.e., reduces each word to its base form, or lemma) using WordNetLemmatizer. It returns the list of lemmatized words.
def word_lemmatizer(text):
    lem_text = [lemmatizer.lemmatize(i) for i in text]
    return lem_text

# This function takes in a list of words and stems them (i.e., reduces each word to its root form) using PorterStemmer. It returns a string containing the stemmed words separated by spaces.
def word_stemmer(text):
    stem_text = " ".join([stemmer.stem(i) for i in text])
    return stem_text
In [ ]:
# Turning off df slice copy worning message
pd.set_option('mode.chained_assignment', None)

# Converting data to string format
data['text'] = data['text'].apply(lambda x: str(x))

# Calling remove_html function on all input text
data['text'] = data['text'].apply(lambda x: remove_html(x))

# Calling tokenizer function on all input text
data['text'] = data['text'].apply(lambda x: tokenizer.tokenize(x))

# Calling remove_stopwords function on all input text
data['text'] = data['text'].apply(lambda x: remove_stopwords(x))

# Calling word_lemmatizer function on all input text
data['text'] = data['text'].apply(lambda x: word_lemmatizer(x))

# Calling word_stemmer function on all input text
data['text'] = data['text'].apply(lambda x: word_stemmer(x))

7. Saving data file:

In [ ]:
data.to_csv('SA_model_data')

8. Conclusion:


The above code demonstrates all actions undertaken to prepare input data for next phases - modelling experiments.
Please note that this file is a rather 'open document' format as data preparation requirements may vary depending on a pre-trained model used for training.